Introduction to Open Data Science - Course Project

About the project

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Tue Dec 12 09:30:02 2023"

Initial Thoughts

I have some experience with R and other languages like Python. I find that the most challenging aspect is getting used to the logic of how RStudio and Git interact (i.e. getting the changes sent to my repository and getting the diary to update)

I went over exercise 1 briefly and it was overwhelming the length of it, I am not sure if we were expected to go over it all (but if we are, then I don’t think I will be able to keep up).

Nevertheless, I am still up for the challenge and will just try and absorb as much as I can.

Here is the link to my repository: https://github.com/nicolasyeung/IODS-project


Regression and model validation

This week I learning quite a bit about data wrangling and performing some basic linear regression

date()
## [1] "Tue Dec 12 09:30:02 2023"

Creating an enviroment with needed packages

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(GGally)
## Warning: package 'GGally' was built under R version 4.3.2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

Reading the data in and checking its structure

students2014 <- read.csv("C:/Users/nicol/Downloads/PHD_302/IODS-project/learning2014.csv")
str(students2014)
## 'data.frame':    166 obs. of  8 variables:
##  $ X       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...

Creating a graphical overview of the data

ggpairs(students2014, columns = 3:8)

Here I can see that most of the scores are normally distributed and with no real correlations that I can see

Selecting variables to test a regression model

first_model <- lm(formula= Points ~ attitude + deep  + Age  + surf + stra, data=students2014)
summary(first_model)
## 
## Call:
## lm(formula = Points ~ attitude + deep + Age + surf + stra, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.4759  -3.2812   0.3804   3.5381  11.4897 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 18.30059    5.24076   3.492  0.00062 ***
## attitude     3.43200    0.56991   6.022 1.14e-08 ***
## deep        -1.04374    0.78179  -1.335  0.18375    
## Age         -0.09650    0.05335  -1.809  0.07234 .  
## surf        -1.10529    0.84009  -1.316  0.19016    
## stra         0.96551    0.53917   1.791  0.07523 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.249 on 160 degrees of freedom
## Multiple R-squared:  0.2311, Adjusted R-squared:  0.2071 
## F-statistic:  9.62 on 5 and 160 DF,  p-value: 4.803e-08

I honestly do not know enough about statistics to be able to say anything meaningful from this model. I think that attitude seems to have the biggest impact on points while Age seems to have no effect, so I iwll remove it and see the what happens.

second_model <- lm(formula= Points ~ attitude + deep  + surf + stra, data=students2014)
summary(second_model)
## 
## Call:
## lm(formula = Points ~ attitude + deep + surf + stra, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.7335  -3.1992   0.7975   3.4002  11.6582 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  15.4000     5.0245   3.065  0.00255 ** 
## attitude      3.4362     0.5739   5.987 1.34e-08 ***
## deep         -1.0072     0.7870  -1.280  0.20247    
## surf         -0.9105     0.8390  -1.085  0.27948    
## stra          0.8847     0.5411   1.635  0.10398    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.286 on 161 degrees of freedom
## Multiple R-squared:  0.2154, Adjusted R-squared:  0.1959 
## F-statistic: 11.05 on 4 and 161 DF,  p-value: 6.057e-08

Here it seems that the impact of attitude increased when I took out Age, and the Multiple R squared went down. My understanding of statistics is limited, but if the multiple R squared value went down (which ranges between 0 and 1, with 1 being a perfect correlation) the fact that it went down indicates the model fit better with Age included.

Creating Diagnostic Plots

plot(first_model, which = 1, caption = "Residuals vs Fitted")

plot(first_model, which = 2)

plot(first_model, which = 5)

*From the online textbook I see that linear regresssion models have 4 assumptions: Linear relationship between predictor variables and outcome, independence of variables, normal distribution of variables and equals variance of variables.

**Looking at the Residuals vs Fitted plot it looks like Points are fairly evenly distributed along their axis which indidcates that the models assumptions are valid

***The Q-Q plot also seems to show a normal distriubution in residuals which is also a good check of the model

****The leverage plot seems to show that not many residuals have a large impact on Points. There are a few outliers that might merit removal.


Logistic Regression

Preparing environment

library(tidyverse)
library(dplyr)
library(ggplot2)

Assignment task 2 - reading the csv file I wrangled

alc <- read.csv("C:/Users/nicol/Downloads/PHD_302/IODS-project/data/Assignement3_Logistic_Regression/alc.csv")

Assignment task 3 - creating hypotheses

I think that alcohol consumption with correlate in the following ways:

1) famrel, if the family relationship is higher alc_use will be lower
2) goout will correlate with alc_use, more time for going out more time for alcohol consumption
3) failures will correlate with alc_use, one reason for poor academic performance could be higher alcohol consumption
4) higher variable no will correlate with higher alcohol consumption

Assignment task 4 - Exlporing the data

g1 <- ggplot(data = alc, aes(x = high_use, y = famrel))
g1 + geom_boxplot() + ylab("famrel")

####Interesting, there seems to be a trend here that higher familial relationships do correlate with less alcohol consumption but it seems weak. My prediction is that if I use famrel as a variable in my logistic regression it will not give a strong Rsquared value.

g2 <- ggplot(data = alc, aes(x = high_use, y = goout))
g2 + geom_boxplot() + ylab("goout")

####Here I see my guess of people going out more also drinking more to be true. I feel this is slighlty stonger than famrel, by eye. I would think this would correlate with alcohol high use true.

ggplot(data=alc, aes(x=factor(failures),y=alc_use))+
  geom_boxplot()+
  geom_hline(yintercept = 2, linetype="dashed", colour="red")

####Just looking at the distribution of failures vs alc_use above two, we see that all the people who failed a course also fall into the high_use true condition.

g3 <- ggplot(data = alc, aes(x = high_use, y = higher))
g3 + geom_count(aes(high_use, higher, alpha = 0.7, colour = higher))

####Here I had a hard time finding a plot that would show me the comparison I wanted. It is one of my learning pains in dealing with data visualization, finding the right plot type. But here I can see that of the people who don’t have higher education aspirations there isn’t much of a difference in high_use. But of those who do (who also make up the majority of the students), there are more students who want to go to higher eduction who don’t fall into high_use than do, which corresponds to my guess but perhaps not strongily.

g4 <- ggplot(data = alc, aes(x = higher, y = G3))
g4 + geom_boxplot() + ylab("G3")

####was interested to see if the ambition for higher education also correlated with grades, it seems to. Intresting

Assignment task 5 - logistic regression

m <- glm(high_use ~ famrel + failures + goout + higher, data = alc, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ famrel + failures + goout + higher, 
##     family = "binomial", data = alc)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.4620     0.8795  -1.662   0.0964 .  
## famrel       -0.3747     0.1364  -2.748   0.0060 ** 
## failures      0.4698     0.2283   2.058   0.0396 *  
## goout         0.7576     0.1199   6.319 2.63e-10 ***
## higheryes    -0.5394     0.6271  -0.860   0.3897    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 387.67  on 365  degrees of freedom
## AIC: 397.67
## 
## Number of Fisher Scoring iterations: 4
coef(m)
## (Intercept)      famrel    failures       goout   higheryes 
##  -1.4620255  -0.3747109   0.4698436   0.7575575  -0.5393577
Here I can see that goout correlates the most with high_use, followed by: famrel and failures. Higheryes does correlate with less alcohol use but not so strongly (z= 0.3897)
OR <- coef(m) %>% exp
OR
## (Intercept)      famrel    failures       goout   higheryes 
##   0.2317664   0.6874880   1.5997439   2.1330599   0.5831227
The odds ratios here help me see the impact of goout, I was confused seeing a value of 0.7576, but I forgot I was looking at a logistic regression. The value of the odds ratio of 2.13 makes more sense for the high z score, for those in high_use they are 110% more likely to goout more frequently.
CI <-  confint(m) %>% exp
## Waiting for profiling to be done...
CI
##                  2.5 %    97.5 %
## (Intercept) 0.03965521 1.2706808
## famrel      0.52457434 0.8971249
## failures    1.02980361 2.5365490
## goout       1.69686639 2.7177864
## higheryes   0.16737211 2.0213349
My knowlege of statistics is lacking but I would say that the CI for each of my selected variables is acceptable as more of the data falls within 2.5 and 97.5.

Assignemnt task 6 - Predictive power of my model

m <- glm(high_use ~ famrel + failures + goout, data = alc, family = "binomial")
alc <- mutate(alc, probability = predict(m, type = "response"))
alc <- mutate(alc, prediction = probability > 0.5)
colnames(alc)
##  [1] "X"           "school"      "sex"         "age"         "address"    
##  [6] "famsize"     "Pstatus"     "Medu"        "Fedu"        "Mjob"       
## [11] "Fjob"        "reason"      "guardian"    "traveltime"  "studytime"  
## [16] "schoolsup"   "famsup"      "activities"  "nursery"     "higher"     
## [21] "internet"    "romantic"    "famrel"      "freetime"    "goout"      
## [26] "Dalc"        "Walc"        "health"      "failures"    "paid"       
## [31] "absences"    "G1"          "G2"          "G3"          "alc_use"    
## [36] "high_use"    "probability" "prediction"
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   238   21
##    TRUE     75   36
here I can see that my model have 238 true negatives (false-false), it had 36 true positives (true-true). It had 75 false negatives (false-true, high_use was false but prediction was true) and 21 false-negatives (true-false, the inverse.)
ggplot(alc, aes(x = probability, y = high_use, col = prediction))+
  geom_point()

#### taking a look at the data visually.

loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2594595
Here I see that the average number of wrong predictions in my model is ~26%
library(boot)

#setting up the training data from my model
cv_train <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = nrow(alc))

#subselecting a pool of test data
cv_test <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)


cv_train$delta[1]
## [1] 0.2594595
cv_test$delta[1]
## [1] 0.2702703
Here I see that my model had and error rate of about ~25,9% and the testing data had and error of ~27,6%, therefore my model is slightly better than guessing.

Clusterring and Classification

Preparing environment

library(tidyverse)
library(dplyr)
library(ggplot2)
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.3.2
## corrplot 0.92 loaded
date()
## [1] "Tue Dec 12 09:30:11 2023"

Task 2 - Loading and exploring the MASS data package

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data("Boston")
colnames(Boston)
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"
dim(Boston)
## [1] 506  14
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
head(Boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
The dataframe describes 506 towns in the Boston area across 14 variables.
library(reshape)
## 
## Attaching package: 'reshape'
## The following object is masked from 'package:lubridate':
## 
##     stamp
## The following object is masked from 'package:dplyr':
## 
##     rename
## The following objects are masked from 'package:tidyr':
## 
##     expand, smiths
Boston_melted <- melt(Boston)
## Using  as id variables
ggplot(Boston_melted, aes(x = value)) +
  geom_histogram(binwidth = 1, fill = "turquoise", color = "blue") +
  facet_wrap(~variable, scales = "free") +
  theme_minimal()

\(~\)

Here looking at the data as a distribution we can begin to see some issues. Variables black, crime and age seem to be skewed. Most of the houses seem to have 6 rooms, which is quite lovely.

\(~\)

Task 3 - Graphical overview of data and pairwise correlations

cor_matrix <- cor(Boston) 
corrplot(cor_matrix, method="circle")

\(~\) ##### Here I can see there are some strong negative corerelations: dis v (indus-nox-age) and mdev v lstat. Also there are some strong positive correlations: mdev v rm (which makes sense as the more rooms you have, ideally, the higher the value of your property), idus v nox (the more industry you have the higher the relative air pollution level).

\(~\)

In looking at this data set I am really intersted, at least at first, by what crime correlates too. Here I can see that it positively correlates to rad and tax and negatively coreelates to dis, black and mdev. I could interpret this as crime being correlated to more rural areas and that the higher the mediam value of the properties in your town the lower the crime rate which also makes sense.

\(~\)

Task 4 - Standardize the data prior to LDA

Scaling the Boston data
Boston_scaled <- as.data.frame(scale(Boston))
summary(Boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
bins <- quantile(Boston_scaled$crim)

# create a categorical variable 'crime'
crime <- cut(Boston_scaled$crim, breaks = bins,labels = c("low","med_low","med_high","high"), include.lowest = TRUE)

# remove original crim from the dataset
Boston_scaled <- dplyr::select(Boston_scaled, -crim)

# add the new categorical value to scaled data
Boston_scaled <- data.frame(Boston_scaled, crime)

summary(Boston_scaled)
##        zn               indus              chas              nox         
##  Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723   Min.   :-1.4644  
##  1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723   1st Qu.:-0.9121  
##  Median :-0.48724   Median :-0.2109   Median :-0.2723   Median :-0.1441  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723   3rd Qu.: 0.5981  
##  Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648   Max.   : 2.7296  
##        rm               age               dis               rad         
##  Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658   Min.   :-0.9819  
##  1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049   1st Qu.:-0.6373  
##  Median :-0.1084   Median : 0.3171   Median :-0.2790   Median :-0.5225  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617   3rd Qu.: 1.6596  
##  Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566   Max.   : 1.6596  
##       tax             ptratio            black             lstat        
##  Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033   Min.   :-1.5296  
##  1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049   1st Qu.:-0.7986  
##  Median :-0.4642   Median : 0.2746   Median : 0.3808   Median :-0.1811  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332   3rd Qu.: 0.6024  
##  Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406   Max.   : 3.5453  
##       medv              crime    
##  Min.   :-1.9063   low     :127  
##  1st Qu.:-0.5989   med_low :126  
##  Median :-0.1449   med_high:126  
##  Mean   : 0.0000   high    :127  
##  3rd Qu.: 0.2683                 
##  Max.   : 2.9865
Just checking that the code worked and I now have a crime categorical variable and crim has been dropped.
# Test and train data sets:
n <- nrow(Boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- Boston_scaled[ind,]

# create test set 
test <- Boston_scaled[-ind,]
Here we are preparing the data into test and train sets. We randoming choose 80 percent of the data into df ind which was used in train and the remainder or -ind was used in test.

\(~\)

Task 5 - Fitting Linear Discriminate analysis

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2549505 0.2623762 0.2400990 0.2425743 
## 
## Group means:
##                  zn       indus        chas        nox         rm        age
## low       0.9980571 -0.87283318 -0.11943197 -0.8813773  0.4425432 -0.8956595
## med_low  -0.1116607 -0.27780426  0.02481057 -0.5566847 -0.1260818 -0.3161894
## med_high -0.3617030  0.08304473  0.21473488  0.3081439  0.2036636  0.3744945
## high     -0.4872402  1.01719597 -0.03128211  1.0559943 -0.4875673  0.8496591
##                 dis        rad        tax     ptratio      black      lstat
## low       0.8846150 -0.6930819 -0.7408369 -0.42432495  0.3816945 -0.7640287
## med_low   0.3137358 -0.5582386 -0.4791587 -0.05048958  0.3109138 -0.1197099
## med_high -0.3562852 -0.4632851 -0.3920947 -0.34708027  0.1822619 -0.1008790
## high     -0.8689569  1.6373367  1.5134896  0.77985517 -0.8076996  0.9163587
##                 medv
## low       0.53543759
## med_low   0.02576952
## med_high  0.30479987
## high     -0.66349081
## 
## Coefficients of linear discriminants:
##                 LD1         LD2         LD3
## zn       0.08035577  0.76647426 -0.91688915
## indus    0.12203414 -0.02494986  0.21721940
## chas    -0.11853479 -0.08864677  0.09960958
## nox      0.37619028 -0.82468321 -1.44815210
## rm      -0.16109086 -0.14263647 -0.16658789
## age      0.12738910 -0.31997645 -0.10063139
## dis     -0.01624393 -0.21673264  0.08436810
## rad      4.07653655  0.98736116 -0.18293581
## tax      0.01553463 -0.11315137  0.72266407
## ptratio  0.09488367  0.05619392 -0.23704951
## black   -0.10606554  0.05492863  0.12811500
## lstat    0.22005725 -0.23167364  0.37079059
## medv     0.22928252 -0.34389117 -0.14067764
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9642 0.0265 0.0092
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  graphics::arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda (bi)plot
plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale = 1)

Task 6 - Testing the LDA for how well it could predict crime in Boston towns

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       12      11        1    0
##   med_low    4      15        1    0
##   med_high   0       9       16    4
##   high       0       0        0   29

\(~\) ##### Looking at model it performs well on the high crime prediction, but the low, med_low and med_high seem to be off slightly. med_low is the next best predicted crime level.

Task 7 - K means clusterring

data("Boston")
Boston_scaled <- as.data.frame(scale(Boston))
scaling the original data
dist_eu <- dist(Boston_scaled, method = "euclidean")
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
measuring the euclidean distances between observations
#dist_man <- dist(Boston_scaled, method = "manhattan")
#dist_man
using the method = “manhattan” to shoud the absolute distances between the observations
set.seed(123)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

\(~\) ##### Looking at the elbow it appears that 2 is the optimal number of clusters for k means

# k-means clustering
km <- kmeans(Boston, centers = 2)

# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)

##### I have no idea how to interpret this-

Bonus

#Data
set.seed(32)
data("Boston")
boston_scaled <- scale(Boston) %>% as.data.frame()

#Set seven clusters:
km_bonus <- kmeans(Boston, centers = 8)
boston_scaled$cluster <- km_bonus$cluster

Boston_bonus <- lda(cluster ~ ., data = boston_scaled)

# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  graphics::arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(boston_scaled$cluster)

# plot the lda (bi)plot
plot(Boston_bonus, dimen =2)
lda.arrows(Boston_bonus, myscale = 1)

Boston_bonus
## Call:
## lda(cluster ~ ., data = boston_scaled)
## 
## Prior probabilities of groups:
##          1          2          3          4          5          6          7 
## 0.27075099 0.08695652 0.15217391 0.12252964 0.10869565 0.05335968 0.18379447 
##          8 
## 0.02173913 
## 
## Group means:
##         crim         zn      indus        chas        nox          rm
## 1  1.0097765 -0.4872402  1.0662784 -0.04242540  0.9959393 -0.39626518
## 2 -0.4112593  1.8232528 -1.0587789 -0.27232907 -1.2032247  0.32608884
## 3 -0.4103240  0.4669153 -0.6939781  0.08558914 -0.7573615  0.40725570
## 4 -0.3115695 -0.4872402  0.6973678  0.10868063  0.6498658 -0.24065169
## 5 -0.3906267 -0.1882696 -0.4589025 -0.05757815 -0.6526048  0.01656427
## 6 -0.4105500  0.7276120 -0.7660389 -0.27232907 -0.8389328  0.12107492
## 7 -0.3726011 -0.2488802 -0.5624188  0.15101504 -0.2307439  0.26980542
## 8 -0.1918628 -0.4872402  0.8121161  0.08558914  1.3206359 -0.52452958
##          age         dis        rad         tax    ptratio      black
## 1  0.7599946 -0.82659650  1.5757732  1.53915759  0.8040926 -0.7189340
## 2 -1.4995727  1.76550267 -0.6321108 -0.51640025 -0.7080115  0.3436424
## 3 -0.6511612  0.50825445 -0.6835681 -1.09300067 -0.2158123  0.3816348
## 4  0.7665217 -0.74688501 -0.5298939  0.01868990 -0.2268036  0.2441186
## 5 -1.0126162  0.40445826 -0.5725993 -0.71589771 -0.1533051  0.3386187
## 6 -1.1910518  0.90003854 -0.5735274 -0.05678564 -0.3404312  0.3375271
## 7  0.4650140  0.04164319 -0.5323637 -0.67571060 -0.2506439  0.3386363
## 8  0.8257272 -0.69874374 -0.5538062 -0.12654817 -0.6723188 -1.8525431
##        lstat       medv
## 1  0.8432167 -0.6807081
## 2 -0.8161977  0.3320129
## 3 -0.5638615  0.6660396
## 4  0.2594944 -0.1294832
## 5 -0.5252045  0.2809099
## 6 -0.5717207  0.2900036
## 7 -0.1610779  0.2011491
## 8  0.6385135 -0.5996044
## 
## Coefficients of linear discriminants:
##                  LD1          LD2         LD3         LD4          LD5
## crim     0.009913887 -0.137609161 -0.04036563  0.06710828  0.008706277
## zn      -0.063869885 -0.359539354 -0.03252327 -0.37386866 -0.985470185
## indus   -0.064553782 -0.195044951 -0.33201434  0.87866138 -0.853101757
## chas     0.051357463  0.035927002 -0.01879004 -0.06102304  0.054091077
## nox     -0.064270060  0.327990930  0.43933045  0.36992087 -0.682424547
## rm      -0.031831393  0.027532509 -0.07621914 -0.14238990  0.106374027
## age     -0.470229341  1.450265293 -0.23689536 -1.46202224 -0.271967600
## dis      0.361412866 -0.313512208  0.20731668 -0.62537375 -0.395082557
## rad     -1.643884814 -0.791845574 -3.25137506 -0.13663787 -0.481829789
## tax     -9.099980888 -0.327782726  3.36875714 -0.64786007  1.410249515
## ptratio -0.008405428  0.060541663 -0.29843909  0.19223768 -0.102382405
## black    0.023126886 -0.089392096 -0.06586979 -0.19721961  0.616762841
## lstat   -0.032547271 -0.183432824 -0.09333305  0.15316806  0.091791088
## medv    -0.062060374  0.005458806 -0.02837714  0.34842936 -0.056421542
##                  LD6         LD7
## crim     0.098841166  0.03527449
## zn       0.842438878  0.35861794
## indus    0.904577426 -0.95614725
## chas    -0.048494403  0.02316471
## nox     -0.321696150  0.19808743
## rm       0.079509917 -0.04629482
## age      0.130244689 -0.20378571
## dis     -0.476692936 -0.65750487
## rad      0.001318226 -1.00217642
## tax     -0.281439177  0.85946009
## ptratio  0.336896632  0.79163091
## black    0.782771708 -0.43714271
## lstat    0.104232162  0.73179612
## medv     0.242827431  1.10579041
## 
## Proportion of trace:
##    LD1    LD2    LD3    LD4    LD5    LD6    LD7 
## 0.9703 0.0155 0.0084 0.0029 0.0018 0.0009 0.0002
it seams that tax and rad have a high influence on crime followed by age

####Super Bonus

model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
library(plotly)
## Warning: package 'plotly' was built under R version 4.3.2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:reshape':
## 
##     rename
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers',color = train$crime)

Dimensionality reduction techniques

Preparing environment

date()
## [1] "Tue Dec 12 09:30:20 2023"
library(tidyr)
library(dplyr)
library(corrplot)
library(GGally)
library(tibble)
library(ggplot2)
library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 4.3.2

Reading in data

human <- read.csv("C:/Users/nicol/Downloads/PHD_302/IODS-project/data/human.csv")

Taske 1 - Moving the country to rownames

human_ <- column_to_rownames(human, "Country")

ggpairs(human_[-1] , 
        lower = list(continuous = wrap("points", color = "turquoise", alpha = 0.5), 
                     combo = wrap("box", color = "orange", alpha = 0.3), 
                     discrete = wrap("facetbar", color = "yellow", alpha = 0.3) ), 
        diag = list(continuous = wrap("densityDiag",  color = "red", alpha = 0.5) ))

There are quite a number of pairwaise correlations that seem quit strong. GNI, Mat.Mor and Ado.Birth seem to skew to the right. I would guess that they would also correlate with one another quite strongly.

corrplot(cor(human_[-1]))

Here I can see that both Mat.Mor and Ado.Birth correlate positively while also having Mat.Mor and Life.Exp correlating negatively. This makes sense as the younger the person giving birth, the higher the mortaility and the lower the life expectancy.

Expected years of schooling (Edu.Exp) correlates positively with Edu2.FM, Life.Exp and somewhat to GNI. This to me makes sense as you would expect that the better to educational opportunity and availablity the better the other life paramaters would be.

Edu2.FM seems to correlate negatively with Mat.Mor and Ado.Birth. Which would indicate that the lower the educational level of the females the more likely they are to give birth young and succumb to maternal mortality.

Task 3 - PCA of non-standardized data

pca_human <- prcomp(human_[-1])

s <- summary(pca_human)
round(1*s$importance[2, ], digits = 2)
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 
##   1   0   0   0   0   0   0   0
biplot(pca_human, choices = 1:2)
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

#### Here it seems we have GNI responsible for nearly 100% of the variation, seems to good to be true as well as incorrect. It is a good reminder to myself that scale matters, and now we will standardize the variables.

PCA on standardized data

pca_human_sc <- prcomp(scale(human_))
s_sc <- summary(pca_human_sc)
pca_pr_z <- round(100*s_sc$importance[2, ], digits = 2)
pca_pr_z
##   PC1   PC2   PC3   PC4   PC5   PC6   PC7   PC8   PC9 
## 57.61 14.53  8.56  7.06  4.87  3.34  2.40  1.21  0.43
pc_lab <- paste0(names(pca_pr_z), " (", pca_pr_z, "%)")

biplot(pca_human_sc, cex = c(0.4, 0.6), col = c("grey40", "deeppink2"),xlab=pc_lab[1], 
       ylab=pc_lab[2])

Here I can see we have 2 principle components that make up roughly 70% of the variation. For PC1 we can see that it correlates negatively with Edu.Exp, Life.Exp and Edu2.F while also correlating positively with Mat.Mor and Ado.Birth. This makes sense that these variables would cluster based on our correlation plots we did first.

PC2 seems to positively correlate with Labo.FM and Parli.F.

Task 5 - Tea Time

tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)

str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
##  $ frequency       : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
view(tea)
keep_columns <- c("sex", "feminine", "effect.on.health", "healthy", "relaxing", "sugar", "exciting", "slimming", "sophisticated", "frequency")

tea_time <- select(tea, all_of(keep_columns))

pivot_longer(tea_time, cols = everything()) %>% 
  ggplot(aes(value)) + geom_bar() + facet_wrap("name", scales = "free")

\(~\)

Just looking at some variables I found interesting. There are more females who drink tea and find it relaxing and healthy.

# multiple correspondence analysis
mca <- MCA(tea_time, graph = TRUE)

# summary of the model
summary.MCA(mca)
## 
## Call:
## MCA(X = tea_time, graph = TRUE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.181   0.144   0.131   0.121   0.108   0.099   0.092
## % of var.             15.094  12.002  10.900  10.117   8.980   8.220   7.633
## Cumulative % of var.  15.094  27.096  37.996  48.113  57.093  65.313  72.946
##                        Dim.8   Dim.9  Dim.10  Dim.11  Dim.12
## Variance               0.084   0.074   0.060   0.054   0.053
## % of var.              7.017   6.173   4.973   4.491   4.400
## Cumulative % of var.  79.963  86.136  91.109  95.600 100.000
## 
## Individuals (the 10 first)
##                        Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                   |  0.562  0.581  0.284 | -0.663  1.017  0.394 |  0.223
## 2                   |  0.215  0.085  0.041 | -0.321  0.239  0.092 |  0.726
## 3                   | -0.283  0.147  0.095 | -0.575  0.765  0.394 |  0.136
## 4                   |  0.279  0.144  0.098 | -0.534  0.661  0.359 | -0.338
## 5                   |  0.245  0.111  0.054 | -0.416  0.401  0.157 |  0.003
## 6                   |  0.261  0.126  0.069 | -0.800  1.482  0.643 |  0.102
## 7                   |  0.302  0.168  0.058 | -0.497  0.572  0.158 |  0.169
## 8                   | -0.487  0.436  0.210 |  0.174  0.070  0.027 | -0.450
## 9                   |  0.088  0.014  0.009 | -0.135  0.042  0.020 | -0.164
## 10                  |  0.014  0.000  0.000 | -0.698  1.128  0.532 | -0.011
##                        ctr   cos2  
## 1                    0.127  0.045 |
## 2                    1.342  0.470 |
## 3                    0.047  0.022 |
## 4                    0.292  0.144 |
## 5                    0.000  0.000 |
## 6                    0.027  0.011 |
## 7                    0.073  0.018 |
## 8                    0.516  0.179 |
## 9                    0.069  0.030 |
## 10                   0.000  0.000 |
## 
## Categories (the 10 first)
##                         Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## F                   |  -0.513   8.612   0.384 -10.710 |   0.190   1.492   0.053
## M                   |   0.748  12.565   0.384  10.710 |  -0.278   2.177   0.053
## feminine            |  -0.708  11.907   0.378 -10.636 |   0.459   6.299   0.159
## Not.feminine        |   0.534   8.982   0.378  10.636 |  -0.346   4.752   0.159
## effect on health    |   0.547   3.629   0.084   5.020 |   0.975  14.512   0.268
## No.effect on health |  -0.154   1.024   0.084  -5.020 |  -0.275   4.093   0.268
## healthy             |  -0.296   3.383   0.204  -7.815 |  -0.321   5.008   0.240
## Not.healthy         |   0.690   7.893   0.204   7.815 |   0.749  11.686   0.240
## No.relaxing         |   0.332   2.297   0.067   4.468 |   0.362   3.430   0.079
## relaxing            |  -0.201   1.388   0.067  -4.468 |  -0.219   2.073   0.079
##                      v.test     Dim.3     ctr    cos2  v.test  
## F                     3.975 |   0.216   2.118   0.068   4.513 |
## M                    -3.975 |  -0.315   3.090   0.068  -4.513 |
## feminine              6.898 |  -0.019   0.012   0.000  -0.288 |
## Not.feminine         -6.898 |   0.014   0.009   0.000   0.288 |
## effect on health      8.951 |   0.040   0.027   0.000   0.366 |
## No.effect on health  -8.951 |  -0.011   0.008   0.000  -0.366 |
## healthy              -8.479 |  -0.015   0.013   0.001  -0.406 |
## Not.healthy           8.479 |   0.036   0.030   0.001   0.406 |
## No.relaxing           4.868 |   0.888  22.697   0.476  11.933 |
## relaxing             -4.868 |  -0.536  13.715   0.476 -11.933 |
## 
## Categorical variables (eta2)
##                       Dim.1 Dim.2 Dim.3  
## sex                 | 0.384 0.053 0.068 |
## feminine            | 0.378 0.159 0.000 |
## effect.on.health    | 0.084 0.268 0.000 |
## healthy             | 0.204 0.240 0.001 |
## relaxing            | 0.067 0.079 0.476 |
## sugar               | 0.139 0.001 0.244 |
## exciting            | 0.066 0.140 0.021 |
## slimming            | 0.107 0.028 0.001 |
## sophisticated       | 0.091 0.232 0.074 |
## frequency           | 0.291 0.239 0.422 |

I have 2 dimensions that explain about 27% of the variation, Dim1 seems to correleate with sex/feminine and Dim2 seems to sort around the perceived health aspects of tea drinking.

# visualize MCA
plot(mca, invisible=c("ind"), graph.type = "classic", habillage = "quali", palette = c("blue", "turquoise", "red"))

\(~\)

Here I can see that Dim1 correlates positively with not feminine and male, while Dim2 correlates with effect on health and not healthy? So I suppose that there needs to be some additional pruning or combining of variables. The inferences I made from this analysis would be weak.


Assignment 6 Analysis of Longitudinal Data

Preparing the environment

library(tidyverse)
library(dplyr)
library(corrplot)
library(ggplot2)
library(lme4)
## Warning: package 'lme4' was built under R version 4.3.2
## Loading required package: Matrix
## Warning: package 'Matrix' was built under R version 4.3.2
## 
## Attaching package: 'Matrix'
## The following object is masked from 'package:reshape':
## 
##     expand
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
library(cowplot)
## Warning: package 'cowplot' was built under R version 4.3.2
## 
## Attaching package: 'cowplot'
## The following object is masked from 'package:reshape':
## 
##     stamp
## The following object is masked from 'package:lubridate':
## 
##     stamp

Part 1: RATS data

Loadin the data

RATS <- read.csv("C:/Users/nicol/Downloads/PHD_302/IODS-project/rats.csv")

RATSL <- read.csv("C:/Users/nicol/Downloads/PHD_302/IODS-project/ratsl.csv")

glimpse(RATSL)
## Rows: 176
## Columns: 5
## $ ID     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3,…
## $ Group  <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1, …
## $ WD     <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", …
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 470…
## $ Time   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, …

Making Group and ID factors

RATSL$ID <- as.factor(RATSL$ID)
RATSL$Group <- as.factor(RATSL$Group)

glimpse(RATSL)
## Rows: 176
## Columns: 5
## $ ID     <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3,…
## $ Group  <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1, …
## $ WD     <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", …
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 470…
## $ Time   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, …

Plotting the data

ggplot(RATSL, aes(x = Time, y = Weight, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(RATSL$Weight), max(RATSL$Weight)))

Here I can see that group 1 started and a lower average weight than the other groups. Also it seems that group 1 gained weight less over time that the other two groups.

Standardizing the weight

RATSLS <- RATSL %>%
  group_by(Time) %>%
  mutate(stdweight = (Weight-mean(Weight))/sd(Weight)) %>%
  ungroup()

Plotting the standardized data

ggplot(RATSLS, aes(x = Time, y = stdweight, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:16, times=6)) +
  facet_grid(. ~ Group, labeller = label_both) +
  scale_y_continuous(name = "standardized weight")

##### Here you can more clearly see what I saw in the non-standardized data, as well as an outlier in group 2

Summarize the Rat Weights by mean and standard error

RATSLSsum <- RATSLS %>%
  group_by(Group, Time) %>%
  summarise(mean = mean(Weight), n = n(), se = sd(Weight)/sqrt(n)) %>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.

Ploting the summarized data

ggplot(RATSLSsum, aes(x = Time, y = mean)) +
  geom_line(aes(color = Group)) +
  geom_ribbon(aes(ymin = mean - se, ymax = mean + se, fill = Group), alpha = 0.2) +
  scale_y_continuous(name = "Weight")

Outlier Hunting

# Create a summary data by diet group and rat ID with mean as the summary variable (ignoring baseline Time = 1)
RATSL8S <- RATSL %>%
  filter(Time > 1) %>%
  group_by(Group, ID) %>%
  summarise(mean=mean(Weight) ) %>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
# Draw a boxplot of the mean versus treatment
ggplot(RATSL8S, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(weight)")

head(RATSL8S, n=16)
## # A tibble: 16 × 3
##    Group ID     mean
##    <fct> <fct> <dbl>
##  1 1     1      263.
##  2 1     2      239.
##  3 1     3      262.
##  4 1     4      267.
##  5 1     5      271.
##  6 1     6      276.
##  7 1     7      275.
##  8 1     8      268.
##  9 2     9      444.
## 10 2     10     458.
## 11 2     11     456.
## 12 2     12     594 
## 13 3     13     495.
## 14 3     14     536.
## 15 3     15     542.
## 16 3     16     536.
from here I can see that we have one outlier above 500 and one below 250 and one from group 3 that is 495.2
#removing outliers
RATSL8S1 <- RATSL8S %>%
  filter(mean<550 & mean>250 & mean!=495.2)

ggplot(RATSL8S1, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(weight)")

T tests

exclude group 1

t.test(mean ~ Group, data = filter(RATSL8S1,Group!=1), var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  mean by Group
## t = -18.235, df = 4, p-value = 5.32e-05
## alternative hypothesis: true difference in means between group 2 and group 3 is not equal to 0
## 95 percent confidence interval:
##  -98.94088 -72.79246
## sample estimates:
## mean in group 2 mean in group 3 
##        452.4000        538.2667

exclude group 2

t.test(mean ~ Group, data = filter(RATSL8S1,Group!=2), var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  mean by Group
## t = -77.715, df = 8, p-value = 8.377e-13
## alternative hypothesis: true difference in means between group 1 and group 3 is not equal to 0
## 95 percent confidence interval:
##  -277.5065 -261.5125
## sample estimates:
## mean in group 1 mean in group 3 
##        268.7571        538.2667

exclude group 3

t.test(mean ~ Group, data = filter(RATSL8S1,Group!=3), var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  mean by Group
## t = -44.305, df = 8, p-value = 7.434e-11
## alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
## 95 percent confidence interval:
##  -193.2012 -174.0845
## sample estimates:
## mean in group 1 mean in group 2 
##        268.7571        452.4000
all the T test comparisons reject the null hypothesis, the difference between each pair of groups is not zero. There is a statistically significant difference between each group.

Analysis of Varianace ANOVA

Create a baseline

RATSL8S2 <- RATSL8S %>%
  mutate(baseline = RATS$WD1)

Filter out the outliers like before

RATSL8S2 <- RATSL8S2 %>%
  filter(mean<550 & mean>250 & mean!=495.2)

Fit a model and perform ANOVA

fit <- lm(mean ~ baseline + Group, data = RATSL8S2)

anova(fit)
## Analysis of Variance Table
## 
## Response: mean
##           Df Sum Sq Mean Sq  F value    Pr(>F)    
## baseline   1 174926  174926 6630.941 3.216e-14 ***
## Group      2   2065    1033   39.147 3.628e-05 ***
## Residuals  9    237      26                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Conclusion: Both baseline weight and group had an effect on weight. The heavier the rat was at the beginning, the more likely it was to be heavier at the end, which makes sense, if you get more calorie dense food at a higher weight it will go up and be even higher. The group also effected the weight increase.

Part 2: BPRS data

Load the data

BPRS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt", sep =" ", header = T)

BPRS$treatment <- factor(BPRS$treatment)
BPRS$subject <- factor(BPRS$subject)

BPRSL <-  pivot_longer(BPRS, cols = -c(treatment, subject),
                       names_to = "weeks", values_to = "bprs") %>%
  arrange(weeks) 

BPRSL <-  BPRSL %>% 
  mutate(week = as.integer(substr(weeks,5,5)))

str(BPRSL)
## tibble [360 × 5] (S3: tbl_df/tbl/data.frame)
##  $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ subject  : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ weeks    : chr [1:360] "week0" "week0" "week0" "week0" ...
##  $ bprs     : int [1:360] 42 58 54 55 72 48 71 30 41 57 ...
##  $ week     : int [1:360] 0 0 0 0 0 0 0 0 0 0 ...
Good now I have my data in long form with factorized variables

Plot first ask questions later

ggplot(BPRSL, aes(x = week, y = bprs, group = subject)) +
  geom_line(aes(col = subject), alpha = 0.5) +
  facet_grid(. ~ treatment, labeller = label_both) +
  scale_x_continuous(name = "Week") + 
  scale_y_continuous(name = "Score") +
  theme(legend.position = "none")

I can’t really see a pattern or correlation here so far.

Regression Modelling

Linear Model

BPRS_lm <- lm(bprs ~ week + treatment, data = BPRSL)
summary(BPRS_lm)
## 
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## week         -2.2704     0.2524  -8.995   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16
Here it seems that week has a higher correlation that treatment in terms of bprs scores, they seem to decline by 2 points a week.

Random Intercept Model

BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)

summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2748.7   2768.1  -1369.4   2738.7      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0481 -0.6749 -0.1361  0.4813  3.4855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.41    6.885  
##  Residual             104.21   10.208  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     1.9090  24.334
## week         -2.2704     0.2084 -10.896
## treatment2    0.5722     1.0761   0.532
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.437       
## treatment2 -0.282  0.000

Random slope and random intercept model

BPRS_ref2 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = FALSE)

summary(BPRS_ref2)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2745.4   2772.6  -1365.7   2731.4      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8919 -0.6194 -0.0691  0.5531  3.7976 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.8222  8.0512        
##           week         0.9609  0.9802   -0.51
##  Residual             97.4305  9.8707        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.1052  22.066
## week         -2.2704     0.2977  -7.626
## treatment2    0.5722     1.0405   0.550
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.582       
## treatment2 -0.247  0.000

ANOVA of the two models

anova(BPRS_ref2, BPRS_ref)
## Data: BPRSL
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref2: bprs ~ week + treatment + (week | subject)
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## BPRS_ref     5 2748.7 2768.1 -1369.4   2738.7                       
## BPRS_ref2    7 2745.4 2772.6 -1365.7   2731.4 7.2721  2    0.02636 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Comparing between the two models we see that the second performs better

Adding Interaction

BPRS_ref3 <- lmer(bprs ~ week * treatment + (week | subject), data = BPRSL, REML = FALSE)

summary(BPRS_ref3)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week * treatment + (week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2744.3   2775.4  -1364.1   2728.3      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0512 -0.6271 -0.0768  0.5288  3.9260 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.9964  8.0620        
##           week         0.9687  0.9842   -0.51
##  Residual             96.4707  9.8220        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##                 Estimate Std. Error t value
## (Intercept)      47.8856     2.2521  21.262
## week             -2.6283     0.3589  -7.323
## treatment2       -2.2911     1.9090  -1.200
## week:treatment2   0.7158     0.4010   1.785
## 
## Correlation of Fixed Effects:
##             (Intr) week   trtmn2
## week        -0.650              
## treatment2  -0.424  0.469       
## wek:trtmnt2  0.356 -0.559 -0.840

Anova of RSRI and RSRI+i

anova(BPRS_ref3, BPRS_ref2)
## Data: BPRSL
## Models:
## BPRS_ref2: bprs ~ week + treatment + (week | subject)
## BPRS_ref3: bprs ~ week * treatment + (week | subject)
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## BPRS_ref2    7 2745.4 2772.6 -1365.7   2731.4                       
## BPRS_ref3    8 2744.3 2775.4 -1364.1   2728.3 3.1712  1    0.07495 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Not quite enough better

Plotting Fitted Scores

Fitted <- fitted(BPRS_ref3)
BPRSL$Fitted <- Fitted

ggplot(BPRSL, aes(x = week, y = Fitted, group = subject)) +
  geom_line(aes(col = subject)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  scale_x_continuous(name = "Week") +
  scale_y_continuous(name = "Fitted BPRS") +
  theme(legend.position = "none")

#### Observed vs Fitted

Fitted_plot<- ggplot(BPRSL, aes(x = week, y = Fitted, group = interaction(treatment, subject))) +
  geom_line(aes(linetype = treatment, colour=treatment)) +
  scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 60, 20)) +
  scale_y_continuous(name = "Fitted BPRS") +
  theme(legend.position = "top")  +
  ggtitle("Fitted")

# Plot the RATSL data
Observed_plot <- ggplot(BPRSL, aes(x = week, y = bprs, group = interaction(treatment, subject)))+
  geom_line(aes(linetype = treatment, colour=treatment))+
  scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 60, 20))+
  scale_y_continuous(name = "BPRS")+
  theme(legend.position = "top")+
  ggtitle("Observed")

plot_grid(Observed_plot, Fitted_plot, labels = "AUTO") 

#### Making sense out of madness, the general trend is for the BPRS scores to go down, regardless of treatment, which was difficult to see from the oberserved data.

##Complete, Happy Holidays:)